DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs
105
ߙ
ොߙ
݂ሺߙሻ
መ݂ሺොߙሻ
Optimized solution
Sub-optimized solution
݂
ߙଵ
כ ൌොߙଵ
ߙଶ
כ ൌොߙଶߙ
ߙଵ
כ
ߙଶ
כ
ොߙଵ
ොߙଶ
ොߙଵ
כ
ߙଵ
כ
ߙ
݂
ߙଵ
כ
ߙଶ
כ
ොߙଵ
כ
ොߙଶ
כ
ොߙଶ
כ
ߙଶ
כ
Optimization of real-valued architecture
Optimization of binary architecture
FIGURE 4.10
Motivation for DCP-NAS. We first show directly binarizing real-valued architecture to 1-
bit is sub-optimal. Thus we use tangent propagation (middle) to find an optimized 1-bit
neural architecture along the tangent direction, leading to a better-performed 1-bit neural
architecture.
4.4
DCP-NAS:
Discrepant
Child-Parent
Neural
Architecture
Search for 1-Bit CNNs
Based on CP-NAS introduced above, the real-valued models converge much faster than the
1-bit models, as revealed in [157], which motivates us to use the tangent direction of the
Parent supernet (real-valued model) as an indicator of the optimization direction for the
Child supernet (1-bit model). We assume that all the possible 1-bit neural architectures
can be learned from the tangent space of the Parent model, based on which we introduce a
Discrepant Child-Parent Neural Architecture Search (DCP-NAS) [135] method to produce
an optimized 1-bit CNN. Specifically, as shown in Fig. 4.10, we use the Parent model to find
a tangent direction to learn the 1-bit Child through tangent propagation rather than directly
binarizing the Parent-to-Child relationship. Since the tangent direction is based on second-
order information, we further accelerate the search process by Generalized Gauss-Newton
matrix (GGN), leading to an efficient search process. Furthermore, a coupling relationship
exists between weights and architecture parameters in such DARTS-based [151] methods,
leading to an asynchronous convergence and an insufficient training process. To overcome
this obstacle, we propose a decoupled optimization for training the Child-Parent model,
leading to an effective and optimized search process. The overall framework of our DCP-
NAS is shown in Fig. 4.11.
4.4.1
Preliminary
Neural architecture search Given a conventional CNN model, we denote w ∈W and
W = RCout×Cin×K×K and ain ∈RCin×W ×H as its weights and feature maps in the specific
layer. Cout and Cin represent the output and input channels of the specific layer. (W, H) is
the width and height of the feature maps and K is the kernel size. Then we have
aout = ain ⊗w,
(4.19)